Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 18, 2025

Which issue does this PR close?

N/A

Rationale for this change

from_json is a widely used Spark SQL function.

What changes are included in this PR?

  • Add very basic support for from_json for structs and a handful of primitive types.
  • Add microbenchmarks

What changes are NOT included in this PR?

  • Support for arrays and maps
  • Support for edge cases (numerics in single or double quotes, for example)
  • Fuzz testing

How are these changes tested?

New tests.

Benchmark Results

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
from_json - simple primitives:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                5304           5356          73          0.2        5058.6       1.0X
SQL Parquet - Comet (Scan)                         5456           5484          40          0.2        5203.2       1.0X
SQL Parquet - Comet (Scan, Exec)                    310            322          13          3.4         295.3      17.1X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
from_json - all primitive types:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                1716           1729          19          0.6        1636.3       1.0X
SQL Parquet - Comet (Scan)                         1713           1719           8          0.6        1633.9       1.0X
SQL Parquet - Comet (Scan, Exec)                    776            777           1          1.4         740.0       2.2X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
from_json - with nulls:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                3293           3315          31          0.3        3140.9       1.0X
SQL Parquet - Comet (Scan)                         3322           3362          56          0.3        3168.3       1.0X
SQL Parquet - Comet (Scan, Exec)                    264            297          35          4.0         252.0      12.5X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
from_json - nested struct:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                6901           6908          11          0.2        6580.9       1.0X
SQL Parquet - Comet (Scan)                         6873           6925          73          0.2        6554.5       1.0X
SQL Parquet - Comet (Scan, Exec)                    421            457          34          2.5         401.8      16.4X

OpenJDK 64-Bit Server VM 11.0.22+7-LTS on Mac OS X 14.6.1
Apple M3 Max
from_json - field access:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                5555           5599          61          0.2        5298.0       1.0X
SQL Parquet - Comet (Scan)                         5530           5559          41          0.2        5274.0       1.0X
SQL Parquet - Comet (Scan, Exec)                    255            292          36          4.1         242.9      21.8X

@codecov-commenter
Copy link

codecov-commenter commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 44.11765% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.60%. Comparing base (f09f8af) to head (db738e1).
⚠️ Report is 782 commits behind head on main.

Files with missing lines Patch % Lines
...rc/main/scala/org/apache/comet/serde/structs.scala 42.42% 13 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2934      +/-   ##
============================================
+ Coverage     56.12%   59.60%   +3.47%     
- Complexity      976     1381     +405     
============================================
  Files           119      167      +48     
  Lines         11743    15493    +3750     
  Branches       2251     2568     +317     
============================================
+ Hits           6591     9234    +2643     
- Misses         4012     4959     +947     
- Partials       1140     1300     +160     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title feat: Add partial support for from_json [WIP] feat: Add partial support for from_json Dec 18, 2025
@andygrove andygrove marked this pull request as ready for review December 18, 2025 17:09
@coderfender
Copy link
Contributor

coderfender commented Dec 18, 2025

This is awesome ! I believe we have an unwanted cargo.lock file in the PR @andygrove

@andygrove
Copy link
Member Author

This is awesome ! I believe we have an unwanted cargo.lock file in the PR @andygrove

The cargo.lock file has been updated due to the new dependency on serde_json.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly comment nits. Thanks @andygrove!

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove! This will be a great expression to support.

@comphead
Copy link
Contributor

Thanks @andygrove great start, it doesn't support nested structs?

@andygrove
Copy link
Member Author

Thanks @andygrove great start, it doesn't support nested structs?

It does support nested structs. See test("from_json - nested struct").

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have a test with valid json and wrong schema?

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove 💪

@andygrove andygrove merged commit fd53edb into apache:main Dec 20, 2025
310 of 315 checks passed
@andygrove andygrove deleted the from-json branch December 20, 2025 15:38
@andygrove
Copy link
Member Author

Thanks for the reviews @mbutrovich and @comphead. I went ahead and merged despite the Spark 4 / hive-1 test failure which is happening on many PRs now. I filed #2946 to get this resolved. The test is not related to from_json support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants